Maciej Dobrzynski (Intitute of Cell Biology, University of Bern)
March 19, 2020
The aim of this workshop is to process raw time-series data from time-lapse microscopy experiment. We will:
Intermediate data sets throughout the workshop:
data.table
Extension of base R's data.frame stracture.
Fast data manipulation with a concise SQL-like syntax.
Check out the vignette for an introduction.
Packages are R's greatest strength but may create confusion.
CRAN = The Comprehensive R Archive Network, it is a package repository that currewntly features 15'359 packages.
Aside from an obligatory reference manual, many packages include vignettes – digestible intros into working with a package.
An opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Brief, visual sumamries of packages' functionality.
Divide your work into multiple contexts with RStudio projects.
Each project has its own working directory, workspace, history, and source documents.
You can create an RStudio project:
File > New Project… to create a new R project
Use # to add comments to your code.
Any comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) creates a section in the code.
Navigate code sections, using Jump To menu at the bottom of the editor.
The outline of code sections is in the upper right corner of the editor window.
RStudio supports folding for code regions (small downward triangles next to line numbers on the left in the editor window).
To indent or reformat the code use:
Stick to a single naming convention throughout the code. A convenient convention is a so-called camel notation, where names of variables, constants, functions are constructed by capitalizing each comound of the name, e.g.:
calcStatsOfDF - function to calculate statsnIteration - prefix n to indicate an integer variablefPatientAge - f to indicate a float variablesPatientName - s to indicate a string variablevFileNames - v for vectorlColumnNames - l for a listIf a variable is available/valid only within a function, prefix the name with loc, e.g.:
locNiterlocVfileNamesAn R Notebook is a document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.
The text is formatted using R Markdown.
Use R Notebooks to create publish-ready documents with text, graphics, and interactive plots.
Can be exported to html, pdf, or a Word document.
Can also include interactive elements!
Data.table package is a faster, more efficient framework for data manipulation compared to R's default data.frame. It provides a consistent syntax for subsetting, grouping, updating, merging, etc.
Let's define a data table:
require(data.table)
dtBabies = data.table(name= c("Jackson", "Emma", "Liam", "Ava"),
gender = c("M", "F", "M", "F"),
year2011= c(74.69, NA, 88.24, 81.77),
year2012=c(84.99, NA, NA, 96.45),
year2013=c(91.73, 75.74, 101.83, NA),
year2014=c(95.32, 82.49, 108.23, NA),
year2015=c(107.12, 93.73, 119.01, 105.65))
dtBabies
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Emma F NA NA 75.74 82.49 93.73
3: Liam M 88.24 NA 101.83 108.23 119.01
4: Ava F 81.77 96.45 NA NA 105.65
The general form of data.table syntax is as follows:
DT[i, j, by]
## R: i j by
## SQL: where | order by select | update group by
The way to read it (out loud) is: take DT, subset/reorder rows using i, then calculate j, grouped by by.
Select specific records:
dtBabies[gender == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
dtBabies[, .(name, gender, year2015)]
name gender year2015
1: Jackson M 107.12
2: Emma F 93.73
3: Liam M 119.01
4: Ava F 105.65
Calculate the mean of a single column:
dtBabies[,
.(meanWeight = mean(year2015))]
meanWeight
1: 106.3775
Calculate the mean of a single column by gender:
dtBabies[,
.(meanWeight = mean(year2015)),
by = gender]
gender meanWeight
1: M 113.065
2: F 99.690
Hardcoding column names in the script is potentially dangerous, for example when for some reason column names change.
Better solution: store column names at the beginning of the script, where it's easy to change them. Then use variables with those names in the code.
lCol = list(meas = 'weight',
time = 'year',
group = c('name', 'gender'),
timeLast = 'year2015')
lCol
$meas
[1] "weight"
$time
[1] "year"
$group
[1] "name" "gender"
$timeLast
[1] "year2015"
Select specific records:
dtBabies[get(lCol$group[[2]]) == 'M']
name gender year2011 year2012 year2013 year2014 year2015
1: Jackson M 74.69 84.99 91.73 95.32 107.12
2: Liam M 88.24 NA 101.83 108.23 119.01
Select specific columns:
myColumns = c(lCol$group[[1]], lCol$timeLast)
dtBabies[, ..myColumns]
name year2015
1: Jackson 107.12
2: Emma 93.73
3: Liam 119.01
4: Ava 105.65
The same summary but with column names stored as strings in elements of the list lCol.
The j part of the data.table requires us to use a function get to interpret the string as the column name.
dtBabies[,
.(meanWeight = mean(get(lCol$timeLast))),
by = c(lCol$group[2])]
gender meanWeight
1: M 113.065
2: F 99.690
Our data table is in the wide format. To convert it to long format, use the function melt.
Provide the names of identification (id.vars) and measure variables (measure.vars). If none are provided, melt guesses them automatically, which may result in a wrong conversion. Both variables can be given as strings with column names, or as column numbers.
The original data frame contains missing values; na.rm=T omits them in the long-format table.
dtBabiesLong = data.table::melt(dtBabies,
id.vars = c('name', 'gender'), measure.vars = 3:7,
variable.name = 'year', value.name = 'weight',
na.rm = T)
head(dtBabiesLong, n = 5L)
name gender year weight
1: Jackson M year2011 74.69
2: Liam M year2011 88.24
3: Ava F year2011 81.77
4: Jackson M year2012 84.99
5: Ava F year2012 96.45
The function dcast converts from long to wide format. The function has a so called formula interface that specifies a combination of variables that uniquely identify a row.
Note that because some combinations of name + gender + year do not exist, the dcast function will introduce NAs.
dtBabiesWide = data.table::dcast(dtBabiesLong,
name + gender ~ year,
value.var = 'weight')
dtBabiesWide
name gender year2011 year2012 year2013 year2014 year2015
1: Ava F 81.77 96.45 NA NA 105.65
2: Emma F NA NA 75.74 82.49 93.73
3: Jackson M 74.69 84.99 91.73 95.32 107.12
4: Liam M 88.24 NA 101.83 108.23 119.01
There are two ways to use the formula interface with string variables:
Create formula string explicitly
as.formula(
paste0(
lCol$group[[1]], "+", lCol$group[[2]],
"~", lCol$time))
name + gender ~ year
Use reformulate function. The limitation being a single response variable only!
reformulate(response = c(lCol$group[[1]], lCol$group[[2]]),
termlabels = lCol$time)
name ~ year
fread
Fast reading of the files; use nThread option to take advantage of multiple threads and read files even faster!
require(data.table)
require(R.utils) # read gz files directly
mydt = fread(file = "inFile.csv.gz",
nThread = 4)
fwrite
for fast writing; also compressed files (gz and bz2).
require(data.table)
require(R.utils) # gzip files
myFilePath = "outFile.csv"
fwrite(x = mydt,
file = myFilePath,
row.names = F)
# compresses; overwrites any pre-existing files
gzip(myFilePath, overwrite = T)
The ggPlot package is a powerfull plotting package that requires data in the long format.
Let's plot weight over time.
require(ggplot2)
p1 = ggplot2::ggplot(dtBabiesLong,
aes(x = year,
y = weight)) +
geom_line()
Oops, the plotting function doesn't know how to link the points. The logical way to link them is by name column.
Here we add group option in the ggplot aesthetics (aes) to avoid the mistake from the above.
Also, to avoid hard-coding column names we use aes_string instead of aes.
The data will be plotted as lines, with additional dots to indicate data points
ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point()
In order to produce facets per gender, we use function facet_wrap.
It uses formula interface, same as in the case of dcast. Again, we need to build the formula from a string.
sFormula = paste0('~', lCol$group[2])
sFormula
[1] "~gender"
ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point() +
facet_wrap(lCol$group[2])
p1 = ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point() +
facet_wrap(sFormula) +
stat_summary(fun.y = mean,
aes(group=1),
geom = "line",
colour = 'red')
The group=1 overrides the by-name grouping so that we get a mean across all names for each gender, rather than the mean of each individual name within each gender (which would be the same as individual observations).
p1 = ggplot2::ggplot(dtBabiesLong,
aes_string(x = lCol$time,
y = lCol$meas,
group = lCol$group[1])) +
geom_line() +
geom_point() +
stat_summary(fun.y = mean,
aes(group=1),
geom = "line",
colour = 'red') +
facet_wrap(sFormula) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1))
Making an interactive plot from a ggplot object is extremely easy. Just use ggplotly function from the amazing plotly package. The interactive plot will remain in the html document knitted from the R notebook.
require(plotly)
plotly::ggplotly(p1)
Let's write a function to calculate statistics of a data frame. In the simplest case the function will calculate the mean of a single column of a data frame.
We will expand the funciton to calculate the mean by a group, and to calculate robust statistics, i.e. median instead of the mean.
The function will need the following input parameters:
calcStats = function(inDt,
inMeasVar,
inGroupName = NULL,
inRobust = F) {
if (inRobust) {
outDt = inDt[,
.(medianMeas = median(get(inMeasVar))),
by = inGroupName]
} else {
outDt = inDt[,
.(meanMeas = mean(get(inMeasVar))),
by = inGroupName]
}
return(outDt)
}
Since column names will be provided to our function as string parameters, we cannot hard-code them inside of the function. Therefore, we use function get to use the string stored in the variable inMeasVar as the column name.
Calculate the mean of the weight column:
calcStats(dtBabiesLong, 'weight')
meanMeas
1: 93.79933
Calculate the mean of the weight column by name and gender. Use robust stats:
calcStats(dtBabiesLong, 'weight', inGroupName = c('name', 'gender'), inRobust = T)
name gender medianMeas
1: Jackson M 91.73
2: Liam M 105.03
3: Ava F 96.45
4: Emma F 82.49
Once inside the function, click Menu > Code > Insert Roxygen Skeleton (Shift-Option-Command R). A pecial type of comment will be added above the function. You can add your text next to parameters.
#' Calculates stats of a data frame
#'
#' @param inDt Input data table in the long format
#' @param inMeasVar Name of the measurement column
#' @param inGroupName Name of the grouping column (default NULL)
#' @param inRobust If true, the function calculates median instead of the mean (default False)
#'
#' @return Data table with summary stats
#' @export
#' @import data.table
#'
#' @examples
#' # example usasge
calcStats = function(inDt, inMeasVar, inGroupName = NULL, inRobust = F) {
if (inRobust) {
outDt = inDt[, .(medianMeas = median(get(inMeasVar))), by = inGroupName]
} else {
outDt = inDt[, .(meanMeas = mean(get(inMeasVar))), by = inGroupName]
}
return(outDt)
}
From R-bloggers:
A FOR loop is the most intuitive way to apply an operation to a series by looping through each item one by one, which makes perfect sense logically but should be avoided by useRs given the low efficiency.
As an aletrnative, use apply function family, e.g. lapply.
foreach